Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification

نویسندگان

Daniel Schneegaß

Steffen Udluft

Thomas Martinetz

چکیده

In this paper we present two substantial extensions of Neural Rewards Regression (NRR) [1]. In order to give a less biased estimator of the Bellman Residual and to facilitate the regression character of NRR, we incorporate an improved, Auxiliared Bellman Residual [2] and provide, to the best of our knowledge, the first Neural Network based implementation of the novel Bellman Residual minimisation technique. Furthermore, we extend NRR to Policy Gradient Neural Rewards Regression (PGNRR), where the strategy is directly encoded by a policy network. PGNRR profits from both the data-efficiency of the Rewards Regression approach and the directness of policy search methods. PGNRR further overcomes a crucial drawback of NRR as it extends the accordant problem class considerably by the applicability of continuous action spaces.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Explicit Kernel Rewards Regression for data-efficient near-optimal policy identification

We present the Explicit Kernel Rewards Regression (EKRR) approach, as an extension of Kernel Rewards Regression (KRR), for Optimal Policy Identification in Reinforcement Learning. The method uses the Structural Risk Minimisation paradigm to achieve a high generalisation capability. This explicit version of KRR offers at least two important advantages. On the one hand, finding a near-optimal pol...

متن کامل

Neural Rewards Regression for near-optimal policy identification in Markovian and partial observable environments

Neural Rewards Regression (NRR) is a generalisation of Temporal Difference Learning (TD-Learning) and Approximate Q-Iteration with Neural Networks. The method allows to trade between these two techniques as well as between approaching the fixed point of the Bellman iteration and minimising the Bellman residual. NRR explicitly finds a near-optimal Q-function without an algorithmic framework exce...

متن کامل

Batch Policy Gradient Methods for Improving Seq2seq Conversation Models

We study reinforcement learning of chat-bots with recurrent neural network architectures when the rewards are noisy and expensive to obtain. For instance, a chat-bot used in automated customer service support can be scored by quality assurance agents, but this process can be expensive, time consuming and noisy. Previous reinforcement learning work for natural language processing uses onpolicy u...

متن کامل

Indexability of Restless Bandit Problems and Optimality of Index Policies for Dynamic Multichannel Access

We consider an opportunistic communication system consisting of multiple independent channels with time-varying states. With limited sensing, a user can only sense and access a subset of channels and accrue rewards determined by the states of the sensed channels. We formulate the problem of optimal sequential channel selection as a restless multi-armed bandit process. We establish the indexabil...

متن کامل

Artificial intelligence-based approaches for multi-station modelling of dissolve oxygen in river

ABSTRACT: In this study, adaptive neuro-fuzzy inference system, and feed forward neural network as two artificial intelligence-based models along with conventional multiple linear regression model were used to predict the multi-station modelling of dissolve oxygen concentration at the downstream of Mathura City in India. The data used are dissolved oxygen, pH, biological oxygen demand and water...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification

نویسندگان

چکیده

منابع مشابه

Explicit Kernel Rewards Regression for data-efficient near-optimal policy identification

Neural Rewards Regression for near-optimal policy identification in Markovian and partial observable environments

Batch Policy Gradient Methods for Improving Seq2seq Conversation Models

Indexability of Restless Bandit Problems and Optimality of Index Policies for Dynamic Multichannel Access

Artificial intelligence-based approaches for multi-station modelling of dissolve oxygen in river

عنوان ژورنال:

اشتراک گذاری